1 Learning Objectives

2 Data on the Web

Note: Do not download thousands of HTML files from a website to parse — the admins might block you if you send too many requests. Download your file once in a separate code chunk from manipulating it

3 To Scrape or Not to Scrape

3.1 Robots.txt

  • Websites use a standard approach to identify their “desires” or preferences for scraping their sites
    • They want to allow search enginges to find them
    • They want to protect their “proprietary” data and privacy information
  • There is typically a file called robots.txt which can be accessed directly as URL/robots.txt
  • See https://support.google.com/webmasters/answer/6062608?hl=en for background
  • Okay with the following:
    User-agent: * Disallow:

  • Not okay with
    User-agent: * Disallow: /

  • Search for “user-agent”:

3.2 General Guidelines for scraping

  • Scraping can cause significant load on a web site’s servers and services
  • To avoid being denied several sites offer suggested guidelines
  • One example is Best Practices
  • Scrapers wanting to violate TOS may use multiple methods to disguise their activity e.g., proxy servers

4 Cascading Style Sheets (CSS)

4.1 The Web Site Document Object Model

  • A Web page is a document which can be either displayed in the browser window or as the HTML source; it is the same document in both cases.
  • The Document Object Model (DOM) is a programming interface for HTML and XML documents.
  • It represents the page so programs can change the document structure, style, and content.
  • The DOM represents the document as nodes and objects so programming languages can connect to the page.
  • Every object located within a document is a node of some kind.
    • In an HTML document, an object can be an element node but also a text node or attribute node.


4.2 CSS Selectors

  • We have to know a little bit about CSS to understand how to extract individual elements from a website.

  • CSS is a formatting language based on “selectors” use to control how HTML files should look. Every website is formatted with CSS.

  • Here is some example CSS:

    h3 {
      color: red;
      font-style: italic;
    }
    
    footer div.alert {
      display: none;
    }
  • The part before the curly braces is called a selector. It corresponds to HTML tags.
    • Specifically, for the two examples above, they correspond to:

      <h3>Some text</h3>
      
      <footer>
      <div class="alert">More text</div>
      </footer>
  • The code inside the curly braces are properties.
    • The h3 properties say to make the h3 headers red and in italics.
    • The second CSS chunk says all <div> tags of class "alert" in the <footer> should be hidden.
  • CSS applies the same properties to the same selectors. So every time we use h3, it will result in the h3 styling of red and italicized text.
  • We can use CSS selectors to identify the elements of a website we are interested in.

4.3 Common HTML elements

  • When you scrape the HTML document, and convert it into text, you will see some common HTML tags to delineate the nodes in the document, the internal structure, and convey formatting information

    ` <html>` | At the Start and end of an HTML document
    ` <head>` | Header Information
    `<title> website title </title>` | Website Title
    `<body>` | Before and after all the content
    `<div> ... </div>`    | Divide up page content into sections, and applying styles
    `<h?> heading </h?>`  | Heading (h1 for largest to h6 for smallest)
    `<p> paragraph </p>`  | Paragraph of Text
    `<a href="url">` link name </a>| Ancho with a link to another page or website
    `<img src="filename.jpg">`|    Show an image
    `<ul> <li> list </li> </ul>`  |Unordered, bullet-point list
    `<b> bold </b>`|   Make text between tags bold
    `<i> italic </i>`|    Make text between tags italic
    `<br>`    |Line Break (force a new line)
    `<span style="color:red"> red </span>`    |Use CSS style to change text colour

5 Using SelectorGadget with Chrome

5.1 IMDB Example

  • Suppose we wanted to get the top 100 movies of all time from IMDB. The web page is very unstructured:

    https://www.imdb.com/list/ls055592025/

     

  • If we click on the ranking of the Godfather, the “1” turns green (indicating what we have selected).

     

  • The “.text-primary” is the selector associated with the “1” we clicked on. Look at the box in the bottom right

  • Everything highlighted in yellow also has the “.text-primary” selector associated with it.

  • We will also want the name of the movie. So if we click on that we get the selector associated with both the rank and the movie name: “a , .text-primary”.

     

  • But we also got a lot of stuff we don’t want (in yellow). If we click one of the yellow items we don’t want, it turns red. This indicates we don’t want to select it.

     

  • Only the ranking and the name remain, which are under the selector “.lister-item-header a , .text-primary”.

  • It’s important to visually inspect the selected elements throughout the whole HTML file. SelectorGadget doesn’t always get all of what you want, or it sometimes gets too much.

5.2 Exercise:

  • What selector can we use to get just the genres of each film, the metacritic score, and the IMDB rating?

6 Chrome developer tools:

7 The rvest Package

7.1 Key functions/Steps enabled by rvest

  • Create an internal html document from a url, a file on disk or a string containing html with read_html().
  • Select parts of a document using CSS selectors: html_nodes(doc, "table td")
  • Extract components with html_name() (the name of the tag), html_text() (all text inside the tag), html_attr() (contents of a single attribute) and html_attrs() (all attributes).
  • You can also use rvest with XML files: parse with xml(), then extract components using xml_node(), xml_attr(), xml_attrs(), xml_text() and xml_name().
  • Parse tables into data frames with html_table().
  • Other special use functions:
    • write_html() or write_xml() to save the HTML data to disk
    • Extract, modify and submit forms with html_form(), set_values() and submit_form().
    • Detect and repair encoding problems with guess_encoding() and repair_encoding().

7.2 Use the rvest package to extract elements from HTML files.

```r
library(rvest)
```
  • Use read_html() to save an HTML file to a variable. The variable will be an “xml_document” object

    html_obj <- read_html("https://www.imdb.com/list/ls055592025/")
    html_obj
    ## {html_document}
    ## <html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
    ## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
    ## [2] <body id="styleguide-v2" class="fixed">\n            <img height="1" widt ...
    class(html_obj)
    ## [1] "xml_document" "xml_node"
  • XML stands for “Extensible Markup Language”. It’s a markup language (like HTML and Markdown), useful for representing data.
  • rvest will store the HTML file as an XML object.

  • We can use html_nodes() and the selectors we found in the previous section to get the elements we want.

  • Insert the found selectors as the value for the css argument. This will produce an object of class “xml_nodeset”

    ranking_elements <- html_nodes(html_obj, css = ".lister-item-header a , .text-primary")
    head(ranking_elements)
    ## {xml_nodeset (6)}
    ## [1] <span class="lister-item-index unbold text-primary">1.</span>
    ## [2] <a href="/title/tt0068646/?ref_=ttls_li_tt">The Godfather</a>
    ## [3] <span class="lister-item-index unbold text-primary">2.</span>
    ## [4] <a href="/title/tt0111161/?ref_=ttls_li_tt">The Shawshank Redemption</a>
    ## [5] <span class="lister-item-index unbold text-primary">3.</span>
    ## [6] <a href="/title/tt0108052/?ref_=ttls_li_tt">Schindler's List</a>
  • To extract the text inside the obtained nodes, use html_text():
  • This produces a character vector (here length 200)

    ranking_text <- html_text(ranking_elements)
    head(ranking_text)
    ## [1] "1."                       "The Godfather"           
    ## [3] "2."                       "The Shawshank Redemption"
    ## [5] "3."                       "Schindler's List"
  • After you do this, tidy the data using your tidyverse tools.
    • Need to convert a vector with character numbers and names into a tibble with two columns, one numeric and one character
    • Turn the ranking_text vector into a tibble of one column (nrow() is 200 in this example)
      • Add a column with the row_number()
      • Add a column to state if the row number is even or odd
      • Add a column with a pair of numbers for each movie tied to their rank, e.g., 1,1, 2,2, 3,3,…100,100
    • Now the tibble is ready to be “tidy’ed”
      • select the columns less the row number which as served its purpose
      • pivot_wider() to break out the ranks and movie titles with names_from = iseven and values_from = text
      • Select only the columns named “TRUE” and “FALSE” and rename them as part of the select to “Rank” and “Movie”
      • use parse_number to convert Rank to a number
      • assign to a new data frame
  • Voila

    tibble(text = ranking_text) %>%
      mutate(rownum = row_number(),
             iseven = rownum %% 2 == 0,
             movie = rep(1:100, each = 2)) %>%
      #view()
      select(-rownum) %>%
      #spread(key = "iseven", value = "text") %>% 
      pivot_wider(names_from = iseven, values_from = text) %>% 
      #view()
      select(-movie, "Rank" = "FALSE", movie = "TRUE") %>%
      mutate(Rank = parse_number(Rank)) ->
      movierank
    movierank
    ## # A tibble: 100 x 2
    ##     Rank movie                          
    ##    <dbl> <chr>                          
    ##  1     1 The Godfather                  
    ##  2     2 The Shawshank Redemption       
    ##  3     3 Schindler's List               
    ##  4     4 Raging Bull                    
    ##  5     5 Casablanca                     
    ##  6     6 Citizen Kane                   
    ##  7     7 Gone with the Wind             
    ##  8     8 The Wizard of Oz               
    ##  9     9 One Flew Over the Cuckoo's Nest
    ## 10    10 Lawrence of Arabia             
    ## # … with 90 more rows

7.3 Exercise: Extract the directors and the names of each filmseparately and combine them into a tibble.

7.4 Bigger example using rvest

  • Let’s try and get the name, rank, year, genre, and metascore for each movie:

     

  • We already have all the HTML stored in html_obj so don’t need to scrape again
  • Use SelectorGadget to get the right CSS selectors
  • Copy the CSS selectors and use html_nodes() to create a text vector

    dataobj <- html_nodes(html_obj, 
                          css = ".favorable , .genre, .unbold, 
                                .lister-item-header a")
    datatext <- html_text(dataobj)
  • Look at the data. Look for unwanted data or any patterns you can use to tidy the data.
  • We have a lot of cleaning to do. Note the first 132 elements we got we didn’t even want:

    #view(datatext)
    length(datatext)
    ## [1] 628
    datatext[131:136]
    ## [1] "\n        Suicide\n            (17)\n    "
    ## [2] "\n        1930s\n            (16)\n    "  
    ## [3] "1."                                       
    ## [4] "The Godfather"                            
    ## [5] "(1972)"                                   
    ## [6] "\nCrime, Drama            "
    length(datatext)-132 # we are missing four elements somewhere
    ## [1] 496
  • Notice the rankings are always of the form "\\d+\\.".
    • We can use this pattern and a cumulative sum (like we did the 'rep() before) to figure out to which movies the elements belong.
    • This is necessary because we know we are missing four elements somewhere (e.g., “The Great Dictator” doesn’t have a metacritic score).
  • Similar steps
    • Create a tibble out of the character vector
    • Add a variable for whether the row is a ranking and assign back to the data frame
    • Add a variable that is the cumulative sum of the rankings
    • Use this to filter out the initial (pre-rankings) rows that are all 0 and assign back to the data frame

      datadf <- tibble(text = datatext)
      
      datadf %>%
        mutate(ismovierank = str_detect(text, "^\\d+\\.$")) ->
        datadf
      #view(datadf)
      #
      ## Check to make sure you have 100 ranks
      sum(datadf$ismovierank)
      ## [1] 100
      ## get movie numbers and remove non-movie elements:
      datadf %>%
        mutate(movienum = cumsum(ismovierank)) %>%
        filter(movienum > 0) ->
        datadf
      
      datadf
      ## # A tibble: 496 x 3
      ##    text                         ismovierank movienum
      ##    <chr>                        <lgl>          <int>
      ##  1 "1."                         TRUE               1
      ##  2 "The Godfather"              FALSE              1
      ##  3 "(1972)"                     FALSE              1
      ##  4 "\nCrime, Drama            " FALSE              1
      ##  5 "100        "                FALSE              1
      ##  6 "2."                         TRUE               2
      ##  7 "The Shawshank Redemption"   FALSE              2
      ##  8 "(1994)"                     FALSE              2
      ##  9 "\nDrama            "        FALSE              2
      ## 10 "80        "                 FALSE              2
      ## # … with 486 more rows
  • Now we want to create variables like ismovierank for each data element to identify which rows they are in
    • name: We can use the movierank$movie variable we created before to see which rows are movie names

      datadf %>%
        mutate(isname = text %in% movierank$movie) ->
        datadf
      
      ## make sure we have 100 movies:
      sum(datadf$isname)
      ## [1] 100
      datadf
      ## # A tibble: 496 x 4
      ##    text                         ismovierank movienum isname
      ##    <chr>                        <lgl>          <int> <lgl> 
      ##  1 "1."                         TRUE               1 FALSE 
      ##  2 "The Godfather"              FALSE              1 TRUE  
      ##  3 "(1972)"                     FALSE              1 FALSE 
      ##  4 "\nCrime, Drama            " FALSE              1 FALSE 
      ##  5 "100        "                FALSE              1 FALSE 
      ##  6 "2."                         TRUE               2 FALSE 
      ##  7 "The Shawshank Redemption"   FALSE              2 TRUE  
      ##  8 "(1994)"                     FALSE              2 FALSE 
      ##  9 "\nDrama            "        FALSE              2 FALSE 
      ## 10 "80        "                 FALSE              2 FALSE 
      ## # … with 486 more rows

    = years: note the Years are surrounded by parentheses so we can use regex to add a variable to determine which row is a year:

    datadf %>%
      mutate(isyear = str_detect(text, "\\(\\d+\\)")) ->
      datadf
    
    ## make sure it is 100
    sum(datadf$isyear)
    ## [1] 100
    datadf
    ## # A tibble: 496 x 5
    ##    text                         ismovierank movienum isname isyear
    ##    <chr>                        <lgl>          <int> <lgl>  <lgl> 
    ##  1 "1."                         TRUE               1 FALSE  FALSE 
    ##  2 "The Godfather"              FALSE              1 TRUE   FALSE 
    ##  3 "(1972)"                     FALSE              1 FALSE  TRUE  
    ##  4 "\nCrime, Drama            " FALSE              1 FALSE  FALSE 
    ##  5 "100        "                FALSE              1 FALSE  FALSE 
    ##  6 "2."                         TRUE               2 FALSE  FALSE 
    ##  7 "The Shawshank Redemption"   FALSE              2 TRUE   FALSE 
    ##  8 "(1994)"                     FALSE              2 FALSE  TRUE  
    ##  9 "\nDrama            "        FALSE              2 FALSE  FALSE 
    ## 10 "80        "                 FALSE              2 FALSE  FALSE 
    ## # … with 486 more rows
    • Genre; each genre begins with a new line tag so again we can use regex to identify those rows:

      datadf %>%
        mutate(isgenre = str_detect(text, "^\\n")) ->
        datadf
      
      ## make sure it is 100
      sum(datadf$isgenre)
      ## [1] 100
    • metacritic score: the only rows left should be the metacritic score:
      • check the count
      • create the new variable as not everything else
      datadf %>%
        group_by(ismovierank, isname, isyear, isgenre) %>%
        count() # we are missing four as we suspected
      ## # A tibble: 5 x 5
      ## # Groups:   ismovierank, isname, isyear, isgenre [5]
      ##   ismovierank isname isyear isgenre     n
      ##   <lgl>       <lgl>  <lgl>  <lgl>   <int>
      ## 1 FALSE       FALSE  FALSE  FALSE      96
      ## 2 FALSE       FALSE  FALSE  TRUE      100
      ## 3 FALSE       FALSE  TRUE   FALSE     100
      ## 4 FALSE       TRUE   FALSE  FALSE     100
      ## 5 TRUE        FALSE  FALSE  FALSE     100
      datadf %>%
        mutate(ismeta = !ismovierank & !isname & !isyear & !isgenre) ->
        datadf
      
      datadf
      ## # A tibble: 496 x 7
      ##    text                        ismovierank movienum isname isyear isgenre ismeta
      ##    <chr>                       <lgl>          <int> <lgl>  <lgl>  <lgl>   <lgl> 
      ##  1 "1."                        TRUE               1 FALSE  FALSE  FALSE   FALSE 
      ##  2 "The Godfather"             FALSE              1 TRUE   FALSE  FALSE   FALSE 
      ##  3 "(1972)"                    FALSE              1 FALSE  TRUE   FALSE   FALSE 
      ##  4 "\nCrime, Drama           … FALSE              1 FALSE  FALSE  TRUE    FALSE 
      ##  5 "100        "               FALSE              1 FALSE  FALSE  FALSE   TRUE  
      ##  6 "2."                        TRUE               2 FALSE  FALSE  FALSE   FALSE 
      ##  7 "The Shawshank Redemption"  FALSE              2 TRUE   FALSE  FALSE   FALSE 
      ##  8 "(1994)"                    FALSE              2 FALSE  TRUE   FALSE   FALSE 
      ##  9 "\nDrama            "       FALSE              2 FALSE  FALSE  TRUE    FALSE 
      ## 10 "80        "                FALSE              2 FALSE  FALSE  FALSE   TRUE  
      ## # … with 486 more rows
  • Let’s create a key variable for the data in text using dplyr::case_when() and then use pivot_wider() to spread them:

    datadf %>%
      mutate(key = case_when(ismovierank ~ "rank",
                             isname ~ "name",
                             isyear ~ "year",
                             isgenre ~ "genre",
                             ismeta ~ "metacritic")) %>%
      select(key, text, movienum) %>%
      # spread(key = "key", value = "text") ->
      pivot_wider(names_from = key, values_from = text) ->
      datawide
    
    datawide
    ## # A tibble: 100 x 6
    ##    movienum rank  name               year   genre                    metacritic 
    ##       <int> <chr> <chr>              <chr>  <chr>                    <chr>      
    ##  1        1 1.    The Godfather      (1972) "\nCrime, Drama        … "100      …
    ##  2        2 2.    The Shawshank Red… (1994) "\nDrama            "    "80       …
    ##  3        3 3.    Schindler's List   (1993) "\nBiography, Drama, Hi… "94       …
    ##  4        4 4.    Raging Bull        (1980) "\nBiography, Drama, Sp… "89       …
    ##  5        5 5.    Casablanca         (1942) "\nDrama, Romance, War … "100      …
    ##  6        6 6.    Citizen Kane       (1941) "\nDrama, Mystery      … "100      …
    ##  7        7 7.    Gone with the Wind (1939) "\nDrama, History, Roma… "97       …
    ##  8        8 8.    The Wizard of Oz   (1939) "\nAdventure, Family, F… "100      …
    ##  9        9 9.    One Flew Over the… (1975) "\nDrama            "    "83       …
    ## 10       10 10.   Lawrence of Arabia (1962) "\nAdventure, Biography… "100      …
    ## # … with 90 more rows
  • The data is now “tidy” but not clean. So let’s clean up the remaining variables:
    • Get rid of the new line character with str_replace_all()
    • Get rid of white space in genre with str_squish()
    • Turn metacritic, rank, and year into numbers
    • Get rid of movienum as no longer needed
    • Reassign back to the data frame

      datawide %>%
        mutate(genre = str_replace_all(genre, "\\n", ""),
               genre = str_squish(genre),
               metacritic = parse_number(metacritic),
               rank = parse_number(rank),
               year = parse_number(year),
               movienum=NULL) ->
        datawide
      
      datawide
      ## # A tibble: 100 x 5
      ##     rank name                           year genre                    metacritic
      ##    <dbl> <chr>                         <dbl> <chr>                         <dbl>
      ##  1     1 The Godfather                  1972 Crime, Drama                    100
      ##  2     2 The Shawshank Redemption       1994 Drama                            80
      ##  3     3 Schindler's List               1993 Biography, Drama, Histo…         94
      ##  4     4 Raging Bull                    1980 Biography, Drama, Sport          89
      ##  5     5 Casablanca                     1942 Drama, Romance, War             100
      ##  6     6 Citizen Kane                   1941 Drama, Mystery                  100
      ##  7     7 Gone with the Wind             1939 Drama, History, Romance          97
      ##  8     8 The Wizard of Oz               1939 Adventure, Family, Fant…        100
      ##  9     9 One Flew Over the Cuckoo's N…  1975 Drama                            83
      ## 10    10 Lawrence of Arabia             1962 Adventure, Biography, D…        100
      ## # … with 90 more rows

7.5 Scraping Whole Tables with html_table()

  • When data is in the form of a table, you can format it more easily with html_table().

  • The Wikipedia article on hurricanes: https://en.wikipedia.org/wiki/Atlantic_hurricane_season

  • This contains many tables which might be a pain to copy and paste into Excel (and we would be prone to error if we tried).

  • Let’s try to automate this procedure using `html_table().
  • html_table() makes a few assumptions:
    • No cells span multiple rows
    • Headers are in the first row
  • Save the website HTML once using read_html()

    wikixml <- read_html("https://en.wikipedia.org/wiki/Atlantic_hurricane_season")
  • We’ll extract all of the “table” elements.

    wikidat <- html_nodes(wikixml, "table")
  • Use html_table() to get a list of tables from table elements:

    tablist <- html_table(wikidat)
    ## Error: Table has inconsistent number of columns. Do you want fill = TRUE?
    #class(tablist)
       # length(tablist)
    
    tablist[[19]] %>%
      select(1:4)
    ## Error in eval(lhs, parent, parent): object 'tablist' not found
  • You can clean up, bind, or merge these tables after you have read them in.

7.6 Exercise: The Wikipedia page on the oldest mosques in the world has many tables.

<https://en.wikipedia.org/wiki/List_of_the_oldest_mosques>
  1. Use rvest to read these tables into R to get a list of tables.
  2. Use rvest and SelectorGadget to extract out the category for the table (mentioned in Quran, in northeast Africa, etc).
    1. Check that you have the same number of tables and header values
  3. Use purrr and other functions to clean and tidy the list of tables
  1. Check for missing columns
  2. Check for extra columns, e.g., named NA
  3. Replace century dates with the last year of the century, e.g., 15th becomes 1500
  1. Keep only the building name, the country, and the time it was first built.
  2. Clean the headers
  3. Add the header information to tables dataframes
  4. Merge the table data frames together into one data frame.
  • Hint: It’s easier if you use a css selector of "table.wikitable" to get the table rather than just "table". Get to the developer tools in Chrome and play around with the tables.

    The first 15 rows should look like this when you are done:

    ##                    Building      Country   fb               category
    ## 1           Al-Haram Mosque Saudi Arabia <NA> Mentioned in the Quran
    ## 2            Al-Aqsa Mosque    Palestine <NA> Mentioned in the Quran
    ## 3       The Sacred Monument Saudi Arabia <NA> Mentioned in the Quran
    ## 4               Quba Mosque Saudi Arabia  622 Mentioned in the Quran
    ## 5  Mosque of the Companions      Eritrea  610       Northeast Africa
    ## 6      Negash Āmedīn Mesgīd     Ethiopia  620       Northeast Africa
    ## 7       Masjid al-Qiblatayn      Somalia  620       Northeast Africa
    ## 8            Korijib Masjid     Djibouti  630       Northeast Africa
    ## 9   Mosque of Amr ibn al-As        Egypt  641       Northeast Africa
    ## 10      Mosque of Ibn Tulun        Egypt  879       Northeast Africa
    ## 11          Al-Hakim Mosque        Egypt  928       Northeast Africa
    ## 12          Al-Azhar Mosque        Egypt  972       Northeast Africa
    ## 13      Arba'a Rukun Mosque      Somalia 1268       Northeast Africa
    ## 14       Fakr ad-Din Mosque      Somalia 1269       Northeast Africa
    ## 15 Great Mosque of Kairouan      Tunisia  670       Northwest Africa